The SYSTERS protein family database: Taxon-related protein family size distributions and singleton frequencies
نویسندگان
چکیده
Based on the SYSTERS protein family database, we present taxon-related protein family frequencies and distributions. A set of taxon-related protein families is a subset of the whole family set with respect to one taxon, where taxon is not restricted to the species level but may be any rank in the taxonomy. We examine eight ranks in the lineages of seven organisms. A strong linear correlation is observed between the total number of different families and the number of sequences in the data set under consideration. We fitted the generalised power-law function to protein family distributions in a least-squares sense excluding singleton frequencies. Taxon-related family distributions tend to have the same shape and a negative slope being not larger than -2.1 for large data sets. For smaller data sets, the slope is decreasing down to -3.7. Slopes of family distributions are found to be slowly increasing towards higher taxonomic ranks. Our observations lead to a new estimation of single sequence cluster frequencies. Data sets of various species are studied with respect to being complete or incomplete. Introduction The determination of protein families has been of interest since scientists began to analyse proteins. Several sequence based clustering methods have been proposed. A review of these methods is presented by Heger and Holm in [6]. Other concepts consider protein architecture and structural features, i.e., folds or domains. Some databases, like the SCOP [13] and the CATH [15] classification systems, provide a combination of both sequence-based and structure-based approaches. Estimates of family frequencies have been reported for structure based data sets [3, 14]. A recent enumeration of protein domain families close to both clustering concepts is published in [7]. Discrete protein family distributions are described presenting power-law or generalised power-law function fits, reviewed in [9]. Analysing distributions resulting from sequence-based methods, small data sets are examined by Huynen and von Nimwegen [8], while Unger et al. [17] compare large scale data sets. Several approaches focus on the (complete) genome or proteome of organisms. Others include (incomplete) data sets from various sources and different species. Accordingly, protein rank superkingdom kingdom phylum class order family genus species incomplete data sets Eukaryota Metazoa Chordata Mammalia Primates Hominidae Homo Hs Hs Eukaryota Metazoa Chordata Mammalia Rodentia Muridae Mus Mm Mm Eukaryota Viridiplantae Embryophyta (eudicotyledons) Brassicales Brassicaceae Arabidopsis At At complete data sets Eukaryota Metazoa Arthropoda Insecta Diptera Drosophilidae Drosophila Dm Dm Eukaryota Metazoa Nematoda Chromadorea Rhabditida Rhabditidae Caenorhabditis Ce Ce Eukaryota Fungi Ascomycota SaccharoSaccharoSaccharoSaccharomyces Sc Sc mycetes mycetales mycetaceae Bacteria ProteoGammaproteoEnteroEnteroEscherichia Ec Ec bacteria bacteria bacteriales bacteriaceae Table 1. Total cluster frequencies and total numbers of sequences from 46 taxa. Following the lineages of seven organisms, H. sapiens (Hs), M. musculus (Mm), A. thaliana (At), D. melanogaster (Dm), C. elegans (Ce), S. cerevisiae (Sc), and E. coli (Ec), for up to eight ranks. A kingdom rank for Ec is not available; the taxon eudicotyledons corresponds to a class rank for the plant At and is therefore in parantheses. Sequence numbers are in parentheses. families are either built within one species or covering data from various species. The SYSTERS protein family database is an automatically generated partitioning of all publicly available protein sequences into family and superfamily clusters [11]. Including all sequences from various species opens the opportunity to analyse also higher taxonomic levels (ranks) and allows for an extended view on the evolution of protein families. We compared the taxon-related protein family frequencies and distributions for eight ranks in the lineages of seven organisms. Singleton protein families are found to form the most abundant family size in all distributions. With respect to their biological relevance they have to be surveyed carefully [4], mostly being an artefact of the underlying clustering method. The SYSTERS Release 3 database consists of 82,449 disjoint family clusters with 55,181 of them being single sequence clusters. Sequences ending up in these clusters are mostly fragmental and of minor length. Although database search methods take the length of a sequence into account, shorter sequences in average result in worse E-values than longer sequences when used as query sequence. We present a new estimation of single sequence cluster frequencies based on protein family distributions. The paper is organised as follows: We start with a short recapitulation of the methods and data sets used in the SYSTERS database. The next paragraph covers the taxonomic basics and the data selected for our approach, followed by a description of the methods used to calculate family frequencies and distributions. We will report on our results obtained by the analysis of 46 individual taxa at different taxonomic levels. 0 0.5 1 1.5 2 2.5 3 x 10 5 0 1 2 3 4 5 6 7 8 x 10 4 y(x) = 0.2670 x + 2850 = ( 1 / 3.7452 ) x + 2850 R = 0.9765 Number of Taxon−related Sequences T ax on − re la te d C lu st er F re qu en cy Figure 1. Total cluster frequency and total number of sequences related to 46 taxa. Frequencies, numbers and related taxa as given in Table 1. Slope, intercept and correlation coefficient "! according to a least square fit. Methods SYSTERS Protein Family Database. The SYSTERS database provides an automatically generated grouping of all publicly available protein sequences into disjoint superfamily and family clusters [10]. The underlying redundant sequence set contains sequences from the SWISS-PROT/TrEMBL [2] and the PIR [20] databases as well as of several completely sequenced organisms, e.g., worm [5], fly [16], and yeast [18]. The data set was made up in July 2000 and contains #%$%&(' )%)*$ sequences. Sequences which are identical or nearly identical (at least +%+-, identity) to other sequences over at least +%#., of their entire length were considered redundant, and were removed from the initial sequence set. All results in this paper refer to the non-redundant data set of /%+102' $103+ sequences. The classification of this set of sequences into the SYSTERS cluster set is mainly based on a traditional database search tool [1] and done in two steps, a similarity searching step and a clustering step. First, each sequence in the database is searched against the whole sequence database down to a weak E-value of 024503# . Then, a series of graph-based clustering methods is applied to these pairwise E-values. Taxonomy and chosen Taxa. Organisms are systematically sorted into biologically meaningful groups by the taxonomy. A continually curated data set is provided by the NCBI taxonomy [19]. Scientific names, taxonomic identification numbers (TaxIDs), lineages, and ranks are obtained from this source and are used in the SYSTERS database. A taxon is the systematic entity of the taxonomy covering organisms as leaves and groups of organisms as internal nodes of the taxonomic tree. A lineage is the consecutive listing of all taxa an organism belongs to. From the SYSTERS taxonomy web interface (http://systers.molgen.mpg.de), cluster sets were obtained for 46 taxa. Eight ranks (superkingdom, kingdom, phylum, class, order, family, genus, and species) were chosen along the lineages of the organisms Homo sapiens (Hs), Mus musculus (Mm), Arabidopsis thaliana (At), Drosophila melanogaster (Dm; complete data set), Caenorhabditis elegans (Ce; complete), Saccharomyces cerevisiae (Sc; complete), and Escherichia coli (Ec; complete). Taxon-related Cluster Sets and Cluster Frequencies. Taxon-related cluster sets are extracted as follows: From the whole SYSTERS cluster set, we remove all sequences not belonging to the taxon under consideration. The remaining set of non-empty clusters builds a cluster set only related to this taxon. If we choose for example the taxon chordata, we subtract all sequence entries of non-chordate organisms from the whole SYSTERS cluster set. In this 0 1 2 3 0 2 4 Eukaryota 0 1 2 3 0 2 4 Metazoa 0 1 2 3 0 2 4 Chordata
منابع مشابه
The SYSTERS Protein Family Web Server: Shortcut from large-scale sequence information to phylogenetic information SYSTERS superfamily 114462 comprises most of the Cation efflux domain proteins in Arabidopsis thaliana
With this poster [11], we present the SYSTERS protein family database, an attempt to classify all available protein sequences. In particular, we focus on the capability of the web interface to assist in in-depth analyses of special protein families. We demonstrate this by an analysis of a specific family of transmembraneous metal ion transport proteins characterised by the so called cation effl...
متن کاملSYSTERS, GeneNest, SpliceNest: exploring sequence space from genome to protein
We have integrated the protein families from SYSTERS and the expressed sequence tag (EST) clusters from our database GeneNest with SpliceNest, a new database mapping EST contigs into genomic DNA. The SYSTERS protein sequence cluster set provides an automatically generated classification of all sequences of the SWISS-PROT, TrEMBL and PIR databases into disjoint protein family and superfamily clu...
متن کاملThe SYSTERS Protein Family Database in 2005
The SYSTERS project aims to provide a meaningful partitioning of the whole protein sequence space by a fully automatic procedure. A refined two-step algorithm assigns each protein to a family and a superfamily. The sequence data underlying SYSTERS release 4 now comprise several protein sequence databases derived from completely sequenced genomes (ENSEMBL, TAIR, SGD and GeneDB), in addition to t...
متن کاملWWW access to the SYSTERS protein sequence cluster set
SUMMARY We present a Web server where the SYSTERS cluster set of the non-redundant protein database consisting of sequences from SWISS-PROT and PIR is being made available for querying and browsing. The cluster set can be searched with a new sequence using the SSMAL search tool. Additionally, a multiple alignment is generated for each cluster and annotated with domain information from the Pfam ...
متن کاملMetaFam: a unified classification of protein families. I. Overview and statistics
MOTIVATION Protein sequence classification is becoming an increasingly important means of organizing the voluminous data produced by large-scale genome sequencing projects. At present, there are several independent classification methods. To aid the general classification effort, we have created a unified protein family resource, MetaFam. MetaFam is a protein family classification built upon 10...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003